Rows: 57761 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Date, Source, Site ID, UNITS, Site Name, AQS_PARAMETER_DESC, CBSA_...
dbl (9): POC, Daily Mean PM2.5 Concentration, DAILY_AQI_VALUE, DAILY_OBS_CO...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 15976 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Date, Source, Site ID, UNITS, Site Name, AQS_PARAMETER_DESC, CBSA_...
dbl (9): POC, Daily Mean PM2.5 Concentration, DAILY_AQI_VALUE, DAILY_OBS_CO...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.
We have 20 column names: “Date”,“Source”, “Site ID”,“POC”, “Daily Mean PM2.5 Concentration”,“UNITS”,“DAILY_AQI_VALUE”,“Site Name” ,“DAILY_OBS_COUNT”, “PERCENT_COMPLETE”, “AQS_PARAMETER_CODE”, “AQS_PARAMETER_DESC”, “CBSA_CODE”, “CBSA_NAME”, “STATE_CODE”, “STATE”, “COUNTY_CODE”, “COUNTY”, “SITE_LATITUDE”, “SITE_LONGITUDE” (Not in this order)
We see that the column names are not in the best form to filter/manipulate the data. I will edit the column names so that it is easier to compute the desired information.
summary(PM2022)
Date Source Site ID POC
Length:57761 Length:57761 Length:57761 Min. : 1.000
Class :character Class :character Class :character 1st Qu.: 1.000
Mode :character Mode :character Mode :character Median : 3.000
Mean : 2.531
3rd Qu.: 3.000
Max. :21.000
Daily Mean PM2.5 Concentration UNITS DAILY_AQI_VALUE
Min. : -2.200 Length:57761 Min. : 0.00
1st Qu.: 4.200 Class :character 1st Qu.: 18.00
Median : 7.000 Mode :character Median : 29.00
Mean : 8.564 Mean : 32.94
3rd Qu.: 10.900 3rd Qu.: 45.00
Max. :302.500 Max. :353.00
Site Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
Length:57761 Min. :1 Min. :100 Min. :88101
Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
Mode :character Median :1 Median :100 Median :88101
Mean :1 Mean :100 Mean :88196
3rd Qu.:1 3rd Qu.:100 3rd Qu.:88101
Max. :1 Max. :100 Max. :88502
AQS_PARAMETER_DESC CBSA_CODE CBSA_NAME STATE_CODE
Length:57761 Min. :12540 Length:57761 Length:57761
Class :character 1st Qu.:31080 Class :character Class :character
Mode :character Median :40140 Mode :character Mode :character
Mean :35445
3rd Qu.:41860
Max. :49700
NA's :4761
STATE COUNTY_CODE COUNTY SITE_LATITUDE
Length:57761 Length:57761 Length:57761 Min. :32.58
Class :character Class :character Class :character 1st Qu.:34.14
Mode :character Mode :character Mode :character Median :36.60
Mean :36.37
3rd Qu.:38.10
Max. :41.76
SITE_LONGITUDE
Min. :-124.2
1st Qu.:-121.5
Median :-119.8
Mean :-119.7
3rd Qu.:-118.1
Max. :-115.5
After looking at this, I saw something weird about the data. Daily Mean PM2.5 Concentration’s minimum is -2.200. It doesn’t make sense that the concentration is less than 0. I will explore this later.
Date Source Site ID POC
Length:15976 Length:15976 Length:15976 Min. :1.000
Class :character Class :character Class :character 1st Qu.:1.000
Mode :character Mode :character Mode :character Median :1.000
Mean :1.581
3rd Qu.:1.000
Max. :6.000
Daily Mean PM2.5 Concentration UNITS DAILY_AQI_VALUE
Min. : 0.00 Length:15976 Min. : 0.00
1st Qu.: 7.00 Class :character 1st Qu.: 29.00
Median : 12.00 Mode :character Median : 50.00
Mean : 16.12 Mean : 53.68
3rd Qu.: 20.50 3rd Qu.: 69.00
Max. :104.30 Max. :176.00
Site Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
Length:15976 Min. :1 Min. :100 Min. :88101
Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
Mode :character Median :1 Median :100 Median :88101
Mean :1 Mean :100 Mean :88215
3rd Qu.:1 3rd Qu.:100 3rd Qu.:88502
Max. :1 Max. :100 Max. :88502
AQS_PARAMETER_DESC CBSA_CODE CBSA_NAME STATE_CODE
Length:15976 Min. :12540 Length:15976 Length:15976
Class :character 1st Qu.:23420 Class :character Class :character
Mode :character Median :40140 Mode :character Mode :character
Mean :33270
3rd Qu.:41740
Max. :49700
NA's :929
STATE COUNTY_CODE COUNTY SITE_LATITUDE
Length:15976 Length:15976 Length:15976 Min. :32.63
Class :character Class :character Class :character 1st Qu.:34.07
Mode :character Mode :character Mode :character Median :35.36
Mean :36.00
3rd Qu.:37.77
Max. :41.71
SITE_LONGITUDE
Min. :-124.2
1st Qu.:-121.4
Median :-119.1
Mean :-119.4
3rd Qu.:-117.9
Max. :-115.5
Unlike 2020 data, there minimum for Daily Mean PM2.5 concentration is 0, but the max is 104.3. The max of the 202 data was 302.500. We may have to look into it.
Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
bothyears <-rbind(PM2002, PM2022)#Separated Datesbothyears1 <-data.frame(date = bothyears$Date)year_dataframe<-separate(bothyears1 , "date", c("Month", "Day", "Year"), sep ="/")#Combined the data in which the year column has a long namesemifinal_dataset<-cbind(bothyears,year_dataframe$Year,year_dataframe$Month)#Change all the column names in the data framefinal_dataset <-semifinal_dataset %>%rename("YEAR"="year_dataframe$Year","Daily_Mean_PM2.5_Concentration"="Daily Mean PM2.5 Concentration","Site_Name"="Site Name" , "Month"="year_dataframe$Month" )head(final_dataset)
Date Source Site ID POC Daily_Mean_PM2.5_Concentration UNITS
1 01/05/2002 AQS 060010007 1 25.1 ug/m3 LC
2 01/06/2002 AQS 060010007 1 31.6 ug/m3 LC
3 01/08/2002 AQS 060010007 1 21.4 ug/m3 LC
4 01/11/2002 AQS 060010007 1 25.9 ug/m3 LC
5 01/14/2002 AQS 060010007 1 34.5 ug/m3 LC
6 01/17/2002 AQS 060010007 1 41.0 ug/m3 LC
DAILY_AQI_VALUE Site_Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
1 78 Livermore 1 100 88101
2 92 Livermore 1 100 88101
3 71 Livermore 1 100 88101
4 80 Livermore 1 100 88101
5 98 Livermore 1 100 88101
6 115 Livermore 1 100 88101
AQS_PARAMETER_DESC CBSA_CODE CBSA_NAME
1 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
2 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
3 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
4 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
5 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
6 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
STATE_CODE STATE COUNTY_CODE COUNTY SITE_LATITUDE SITE_LONGITUDE YEAR
1 06 California 001 Alameda 37.68753 -121.7842 2002
2 06 California 001 Alameda 37.68753 -121.7842 2002
3 06 California 001 Alameda 37.68753 -121.7842 2002
4 06 California 001 Alameda 37.68753 -121.7842 2002
5 06 California 001 Alameda 37.68753 -121.7842 2002
6 06 California 001 Alameda 37.68753 -121.7842 2002
Month
1 01
2 01
3 01
4 01
5 01
6 01
Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
#Color Pallete color_palette <-colorFactor(palette =c( "green", "orange"), # Specify colors for each yeardomain = final_dataset$YEAR # Use the YEAR column from your dataset)leaflet(final_dataset) %>%addProviderTiles('OpenStreetMap') %>%addCircles(lat =~SITE_LATITUDE,lng =~SITE_LONGITUDE,opacity =1,fillOpacity =1,radius =100,color =~color_palette(YEAR), # Color circles based on the YEAR columnpopup =~Site_Name # Display SITE_NAME in the popup )%>%addLegend(pal = color_palette,values =~YEAR,title ="Year",position ="bottomright")
The dots are suppose to represent each (data station) data point, but in the exploratory data analysis, we saw that there was more data 2002 and 2020 data set, but that is not reflective on the map. We would expect more dots for 2022, and even more for 2002.The data is limited to just California.
Check for any missing or implausible values of PM2.5 in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
#Finding the missing values: There were no missing vlaues missing_pm <- final_dataset %>%filter(is.na(Daily_Mean_PM2.5_Concentration))max(final_dataset$Daily_Mean_PM2.5_Concentration)
[1] 302.5
min(final_dataset$Daily_Mean_PM2.5_Concentration)
[1] -2.2
#I filted out to see how many negative values there are and other information. negative_PM2.5<- final_dataset %>%filter(Daily_Mean_PM2.5_Concentration<0)dim(negative_PM2.5) #218 values that are less than 0
[1] 218 22
head(negative_PM2.5)
Date Source Site ID POC Daily_Mean_PM2.5_Concentration UNITS
1 07/06/2022 AQS 060010011 3 -0.7 ug/m3 LC
2 07/30/2022 AQS 060010011 3 -0.1 ug/m3 LC
3 08/26/2022 AQS 060010011 3 -0.5 ug/m3 LC
4 02/01/2022 AQS 060072002 3 -0.3 ug/m3 LC
5 02/06/2022 AQS 060072002 3 -0.1 ug/m3 LC
6 02/22/2022 AQS 060072002 3 -0.3 ug/m3 LC
DAILY_AQI_VALUE Site_Name DAILY_OBS_COUNT PERCENT_COMPLETE
1 0 Oakland West 1 100
2 0 Oakland West 1 100
3 0 Oakland West 1 100
4 0 Paradise - Theater 1 100
5 0 Paradise - Theater 1 100
6 0 Paradise - Theater 1 100
AQS_PARAMETER_CODE AQS_PARAMETER_DESC CBSA_CODE
1 88101 PM2.5 - Local Conditions 41860
2 88101 PM2.5 - Local Conditions 41860
3 88101 PM2.5 - Local Conditions 41860
4 88502 Acceptable PM2.5 AQI & Speciation Mass 17020
5 88502 Acceptable PM2.5 AQI & Speciation Mass 17020
6 88502 Acceptable PM2.5 AQI & Speciation Mass 17020
CBSA_NAME STATE_CODE STATE COUNTY_CODE COUNTY
1 San Francisco-Oakland-Hayward, CA 06 California 001 Alameda
2 San Francisco-Oakland-Hayward, CA 06 California 001 Alameda
3 San Francisco-Oakland-Hayward, CA 06 California 001 Alameda
4 Chico, CA 06 California 007 Butte
5 Chico, CA 06 California 007 Butte
6 Chico, CA 06 California 007 Butte
SITE_LATITUDE SITE_LONGITUDE YEAR Month
1 37.81478 -122.2823 2022 07
2 37.81478 -122.2823 2022 07
3 37.81478 -122.2823 2022 08
4 39.77919 -121.5914 2022 02
5 39.77919 -121.5914 2022 02
6 39.77919 -121.5914 2022 02
# I aggregated county data to find the summary statisticfinal_dataset1 <- final_dataset %>%group_by(COUNTY)%>%summarize(Mean_Variable =mean(Daily_Mean_PM2.5_Concentration),Median_Variable =median(Daily_Mean_PM2.5_Concentration),Min_Var=min(Daily_Mean_PM2.5_Concentration),Max_Var=max(Daily_Mean_PM2.5_Concentration),Count =n() )head(final_dataset1)
#I aggregated site data to find the summary statisticfinal_dataset2 <- final_dataset %>%group_by(Site_Name)%>%summarize(Mean_Variable =mean(Daily_Mean_PM2.5_Concentration),Median_Variable =median(Daily_Mean_PM2.5_Concentration),Min_Var=min(Daily_Mean_PM2.5_Concentration),Max_Var=max(Daily_Mean_PM2.5_Concentration),Count =n() )head(final_dataset2)
There are no N/A values in Daily Mean PM2.5 Concentration. The minimum is of this column is -2.2 and the maximum is 302.5. Like mentioned before, it is impossible to have a concentration level less than 0. The value 302.5 makes sense; it would be considered hazardous. In this data set, we have to keep acount that 218 rows in the data set that have a Daily Mean PM2.5 Concentration are less than 0. I then computed the mean, median,min, count and max for each county in finaldataset1. In finaldaraset2, I computed the same information except it is for each site in the data set.
Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.
state
county
site in Los Angeles
STATE ANALYSIS: (Original Data Since I Filtered It
#Box Plot ggplot(final_dataset, aes(x = YEAR, y =Daily_Mean_PM2.5_Concentration )) +geom_boxplot() +labs(title ="Box Plot of Daily Mean PM2.5 Concentration by Year in California", x ="Year", y ="PM2.5")
# A tibble: 2 × 4
YEAR Q25 Median Q75
<chr> <dbl> <dbl> <dbl>
1 2002 7 12 20.5
2 2022 4.2 7 10.9
Here, I created the box and whiskers plot to assess the difference in Daily Mean PM2.5 Concentration between two underlying groups (2002 versus 2022). I then tabled the IQR. I can see there are more outliers in 2022 then there is in 2002. There is more of a spread in 2002 compared to 2022.
#Histogram of the same data ggplot(final_dataset, aes(x =log(Daily_Mean_PM2.5_Concentration), fill=YEAR)) +geom_histogram(binwidth =0.50) +labs(title ="Histogram of PM2.5 Values in California by Year",x ="PM2.5",y ="Frequency")
Warning in log(Daily_Mean_PM2.5_Concentration): NaNs produced
Warning in log(Daily_Mean_PM2.5_Concentration): NaNs produced
In addition to the box and whiskers plot, I made a histogram to show the same information, but in a distribution, I decided to take the log of the x values to transform the data to see the difference in a more digestible way. We can see that the data in 2002 is more spreaded out.
Date Source Site ID POC Daily_Mean_PM2.5_Concentration UNITS
1 01/05/2002 AQS 060010007 1 25.1 ug/m3 LC
2 01/06/2002 AQS 060010007 1 31.6 ug/m3 LC
3 01/08/2002 AQS 060010007 1 21.4 ug/m3 LC
4 01/11/2002 AQS 060010007 1 25.9 ug/m3 LC
5 01/14/2002 AQS 060010007 1 34.5 ug/m3 LC
6 01/17/2002 AQS 060010007 1 41.0 ug/m3 LC
DAILY_AQI_VALUE Site_Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
1 78 Livermore 1 100 88101
2 92 Livermore 1 100 88101
3 71 Livermore 1 100 88101
4 80 Livermore 1 100 88101
5 98 Livermore 1 100 88101
6 115 Livermore 1 100 88101
AQS_PARAMETER_DESC CBSA_CODE CBSA_NAME
1 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
2 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
3 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
4 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
5 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
6 PM2.5 - Local Conditions 41860 San Francisco-Oakland-Hayward, CA
STATE_CODE STATE COUNTY_CODE COUNTY SITE_LATITUDE SITE_LONGITUDE YEAR
1 06 California 001 Alameda 37.68753 -121.7842 2002
2 06 California 001 Alameda 37.68753 -121.7842 2002
3 06 California 001 Alameda 37.68753 -121.7842 2002
4 06 California 001 Alameda 37.68753 -121.7842 2002
5 06 California 001 Alameda 37.68753 -121.7842 2002
6 06 California 001 Alameda 37.68753 -121.7842 2002
Month
1 01
2 01
3 01
4 01
5 01
6 01
# Bar plot for Mean_Variable by COUNTYbar_plot <-ggplot(final_dataset1, aes(x = COUNTY, y = Mean_Variable)) +geom_bar(stat ="identity", fill ="skyblue") +labs(x ="COUNTY", y ="Mean Daily_Mean_PM2.5_Concentration") +ggtitle("Mean Daily PM2.5 Concentration by COUNTY") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))print(bar_plot)
There are 51 counties in this data set, but California actually has 58. In the exploratory data analysis, this information can help with identifying why other counties were not excluded. From the bar plit, we can compare the mean daily PM2.5 concentration in rach county. We can see that the highest concentration amount is Kern and Kinds. We can see the overall trend for each county. However, I do think that there are better wats to visualize it. I think that becuase there are a lot of counties, it can be hard to visualize it. I think a data table with values would be the best way to see which is the highest and which is the lowest.
Los Angeles County Exploration :
#Los Angeles County los_angeles_dataset <-final_dataset %>%filter(COUNTY =="Los Angeles")#unique(los_angeles_dataset[,3])#By County los_angeles_dataset1 <- los_angeles_dataset %>%group_by(Site_Name)%>%summarize(Mean_Variable =mean(Daily_Mean_PM2.5_Concentration),Median_Variable =median(Daily_Mean_PM2.5_Concentration),Min_Var=min(Daily_Mean_PM2.5_Concentration),Max_Var=max(Daily_Mean_PM2.5_Concentration),Count =n())head(los_angeles_dataset1)
# Box plot by yearboxplot(Daily_Mean_PM2.5_Concentration ~ YEAR, data = los_angeles_dataset, main ="Daily Mean PM2.5 Concentrations in Los Angeles, grouped by year", xlab ="County", ylab ="PM2.5 Concentration")
# Bar plot for Mean_Variable by Site_Namebar_plot <-ggplot(los_angeles_dataset1, aes(x = Site_Name, y = Mean_Variable)) +geom_bar(stat ="identity", fill ="skyblue") +labs(x ="Site Name", y ="Mean Daily_Mean_PM2.5_Concentration") +ggtitle("Mean Daily PM2.5 Concentration by Site Name") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))print(bar_plot)
# Box plot for Daily_Mean_PM2.5_Concentration by Site_Namebox_plot <-ggplot(los_angeles_dataset1, aes(x = Site_Name, y = Mean_Variable)) +geom_boxplot(fill ="lightgreen", color ="darkgreen") +labs(x ="Site Name", y ="Mean_Variable") +ggtitle("Distribution of Daily PM2.5 Concentration by Site Name") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))print(box_plot)
For Los Angeles, I made three different visualizations that give about the same information. We can see from the box in whiskers plots that the mean daily PM2.5 concentration in Los Angeles was higher in 2002 compared to 2022. With this information, we hypothesize why there is a difference between mean values between those two years (i.e Less cars? More electric cars? etc.) We can see that both data contain outliers. We also know that there spread is highers in 2002 (range of values). In the barplots, we can see that NA exists. If I were to data clean even more, I would not include the data where the PM2.5 concentrations are not inputted. We can see that Burbank, Lynwood, and NA (we will not mention this) have the highest mean concentration level of PM2.5. We can see that Lebec has the lowest. The same information can be found in the last plot (with green tics). However, I think the blue box plot is a lot better to read and is more visually pleasing.